Abstract

We have a cohort of heart failure patients from CCU with vital features that have been selected from the MIMIC III dataset. Different methods are utilized to predict heart failure mortality in CCU, including logistic regression, SVM, decision tree, random forest and boosting models. One favorite model is chosen from those predictable models evaluated by confusion matrix, ROC curves and AUC. We also figure out which features could be used to predict mortality of heart failure patients from one inference model.

knitr::opts_chunk$set(warning=FALSE, message=FALSE, tidy = TRUE)
setwd <- ("/Users/jill/Documents/614 final assignment/")
library(ggplot2)
library(stats)
library (dplyr)
library(corrplot)
library (dplyr)
library(psych)
library(rgl)
library(pwr)
library(pscl)
library(ISLR)
library(dlookr)
library(Hmisc)
library(pastecs)
library(car)
library(lattice)
library(caret)
library(rpart.plot)
library(pROC)
library(ROCR)
library(randomForest)
library(kernlab)
library(e1071)
library(ROSE)

1. Perform data cleaning.

  1. import dataset
hf_df <- read.csv("final_dataset.csv", header = T, stringsAsFactors = F)
summary(hf_df)
##    subject_id       hadm_id       last_careunit           age       
##  Min.   :   26   Min.   :100199   Length:1335        Min.   :41.00  
##  1st Qu.:11581   1st Qu.:128783   Class :character   1st Qu.:63.00  
##  Median :24950   Median :156806   Mode  :character   Median :73.00  
##  Mean   :34944   Mean   :154118                      Mean   :71.01  
##  3rd Qu.:58048   3rd Qu.:179238                      3rd Qu.:81.00  
##  Max.   :99982   Max.   :199963                      Max.   :89.00  
##     status               los          glucose_num       
##  Length:1335        Min.   : 0.0012   Length:1335       
##  Class :character   1st Qu.: 1.2787   Class :character  
##  Mode  :character   Median : 2.2583   Mode  :character  
##                     Mean   : 3.3424                     
##                     3rd Qu.: 3.9614                     
##                     Max.   :52.8108                     
##   sodium_num        wbc_count_num      calcium_num       
##  Length:1335        Length:1335        Length:1335       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  hemoglobin_num     creatinine_num     urea_nitrogen_num 
##  Length:1335        Length:1335        Length:1335       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  chloride_num         pco2_num        
##  Length:1335        Length:1335       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 
str(hf_df)
## 'data.frame':    1335 obs. of  15 variables:
##  $ subject_id       : int  26 42 55 77 163 176 199 209 214 228 ...
##  $ hadm_id          : int  197661 119203 190665 142768 138528 135828 185360 190711 197273 167764 ...
##  $ last_careunit    : chr  "CCU" "CCU" "CCU" "CCU" ...
##  $ age              : int  72 61 64 45 80 78 41 73 63 79 ...
##  $ status           : chr  "dead" "dead" "Alive" "Alive" ...
##  $ los              : num  2.14 1.9 1.91 1.16 1.53 ...
##  $ glucose_num      : chr  "129" "157" "106" "124" ...
##  $ sodium_num       : chr  "143" "139" "144" "139" ...
##  $ wbc_count_num    : chr  "8.2" "9.3" "11.6" "15.7" ...
##  $ calcium_num      : chr  "8.8" "9.2" "9.4" "8.9" ...
##  $ hemoglobin_num   : chr  "12.3" "12.5" "11.3" "13.3" ...
##  $ creatinine_num   : chr  "1.4" "0.7" "0.8" "1.1" ...
##  $ urea_nitrogen_num: chr  "36" "18" "23" "20" ...
##  $ chloride_num     : chr  "108" "105" "109" "102" ...
##  $ pco2_num         : chr  "NULL" "NULL" "NULL" "52" ...
  1. Data Cleaning
hf <- hf_df
summary(hf)
##    subject_id       hadm_id       last_careunit           age       
##  Min.   :   26   Min.   :100199   Length:1335        Min.   :41.00  
##  1st Qu.:11581   1st Qu.:128783   Class :character   1st Qu.:63.00  
##  Median :24950   Median :156806   Mode  :character   Median :73.00  
##  Mean   :34944   Mean   :154118                      Mean   :71.01  
##  3rd Qu.:58048   3rd Qu.:179238                      3rd Qu.:81.00  
##  Max.   :99982   Max.   :199963                      Max.   :89.00  
##     status               los          glucose_num       
##  Length:1335        Min.   : 0.0012   Length:1335       
##  Class :character   1st Qu.: 1.2787   Class :character  
##  Mode  :character   Median : 2.2583   Mode  :character  
##                     Mean   : 3.3424                     
##                     3rd Qu.: 3.9614                     
##                     Max.   :52.8108                     
##   sodium_num        wbc_count_num      calcium_num       
##  Length:1335        Length:1335        Length:1335       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  hemoglobin_num     creatinine_num     urea_nitrogen_num 
##  Length:1335        Length:1335        Length:1335       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  chloride_num         pco2_num        
##  Length:1335        Length:1335       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 
# Remove meaningless variables for our model: subject_id, hadm_id, last_careunit.
hf <- hf[,-c(1,2,3)]

##  Change variables to appropriate types.
# status need to be changed into factor type
hf$status <- factor(hf$status, levels = c("Alive","dead"))
summary(hf$status)
## Alive  dead 
##   663   672
# glucose_num, sodium_num, wbc_count_num, calcium_num, hemoglobin_num, creatinine_num, urea_nitrogen_num, chloride_num, pco2_num, all of these variables should be changed into numeric variables.
hf[,-c(1:3)] <- data.frame(lapply(hf[,-c(1:3)],as.numeric))
summary(hf)
##       age          status         los           glucose_num    
##  Min.   :41.00   Alive:663   Min.   : 0.0012   Min.   :  78.0  
##  1st Qu.:63.00   dead :672   1st Qu.: 1.2787   1st Qu.: 142.0  
##  Median :73.00               Median : 2.2583   Median : 181.0  
##  Mean   :71.01               Mean   : 3.3424   Mean   : 210.9  
##  3rd Qu.:81.00               3rd Qu.: 3.9614   3rd Qu.: 245.0  
##  Max.   :89.00               Max.   :52.8108   Max.   :1601.0  
##                                                NA's   :11      
##    sodium_num    wbc_count_num      calcium_num     hemoglobin_num 
##  Min.   :123.0   Min.   :  2.500   Min.   : 5.400   Min.   : 7.90  
##  1st Qu.:140.0   1st Qu.:  9.475   1st Qu.: 8.800   1st Qu.:11.30  
##  Median :142.0   Median : 12.600   Median : 9.200   Median :12.50  
##  Mean   :141.8   Mean   : 14.152   Mean   : 9.192   Mean   :12.68  
##  3rd Qu.:144.0   3rd Qu.: 17.000   3rd Qu.: 9.500   3rd Qu.:14.00  
##  Max.   :167.0   Max.   :170.300   Max.   :23.000   Max.   :19.60  
##  NA's   :11      NA's   :11        NA's   :30       NA's   :11     
##  creatinine_num   urea_nitrogen_num  chloride_num      pco2_num     
##  Min.   : 0.400   Min.   :  4.00    Min.   : 87.0   Min.   : 22.00  
##  1st Qu.: 1.100   1st Qu.: 23.00    1st Qu.:103.0   1st Qu.: 40.00  
##  Median : 1.500   Median : 35.00    Median :106.0   Median : 46.50  
##  Mean   : 2.186   Mean   : 44.81    Mean   :106.3   Mean   : 50.22  
##  3rd Qu.: 2.500   3rd Qu.: 58.00    3rd Qu.:109.0   3rd Qu.: 55.00  
##  Max.   :17.800   Max.   :272.00    Max.   :134.0   Max.   :175.00  
##  NA's   :10       NA's   :10        NA's   :11      NA's   :461
# As pco2_num has a lot of NAs, drop this variable.
hf <- subset(hf,select = - pco2_num)

# There are some variables still have a few NAs, will be imputed after data splitting.

2. Display summary information on the data.

summary(hf)
##       age          status         los           glucose_num    
##  Min.   :41.00   Alive:663   Min.   : 0.0012   Min.   :  78.0  
##  1st Qu.:63.00   dead :672   1st Qu.: 1.2787   1st Qu.: 142.0  
##  Median :73.00               Median : 2.2583   Median : 181.0  
##  Mean   :71.01               Mean   : 3.3424   Mean   : 210.9  
##  3rd Qu.:81.00               3rd Qu.: 3.9614   3rd Qu.: 245.0  
##  Max.   :89.00               Max.   :52.8108   Max.   :1601.0  
##                                                NA's   :11      
##    sodium_num    wbc_count_num      calcium_num     hemoglobin_num 
##  Min.   :123.0   Min.   :  2.500   Min.   : 5.400   Min.   : 7.90  
##  1st Qu.:140.0   1st Qu.:  9.475   1st Qu.: 8.800   1st Qu.:11.30  
##  Median :142.0   Median : 12.600   Median : 9.200   Median :12.50  
##  Mean   :141.8   Mean   : 14.152   Mean   : 9.192   Mean   :12.68  
##  3rd Qu.:144.0   3rd Qu.: 17.000   3rd Qu.: 9.500   3rd Qu.:14.00  
##  Max.   :167.0   Max.   :170.300   Max.   :23.000   Max.   :19.60  
##  NA's   :11      NA's   :11        NA's   :30       NA's   :11     
##  creatinine_num   urea_nitrogen_num  chloride_num  
##  Min.   : 0.400   Min.   :  4.00    Min.   : 87.0  
##  1st Qu.: 1.100   1st Qu.: 23.00    1st Qu.:103.0  
##  Median : 1.500   Median : 35.00    Median :106.0  
##  Mean   : 2.186   Mean   : 44.81    Mean   :106.3  
##  3rd Qu.: 2.500   3rd Qu.: 58.00    3rd Qu.:109.0  
##  Max.   :17.800   Max.   :272.00    Max.   :134.0  
##  NA's   :10       NA's   :10        NA's   :11
lapply(hf[,-2], sd, na.rm = T)
## $age
## [1] 11.25485
## 
## $los
## [1] 3.577665
## 
## $glucose_num
## [1] 107.9483
## 
## $sodium_num
## [1] 3.890178
## 
## $wbc_count_num
## [1] 8.61156
## 
## $calcium_num
## [1] 0.8023754
## 
## $hemoglobin_num
## [1] 1.879497
## 
## $creatinine_num
## [1] 1.904505
## 
## $urea_nitrogen_num
## [1] 29.19859
## 
## $chloride_num
## [1] 5.234303

Summary

This dataset has 11 variables, 10 are numerical and 1 is categorical variable. We’ll choose status as response variable and other 10 numerical variables as predictoes.

3. Create visualizations of the distributions of key variables by the response variable. (Ex: colored by mortality).

Univariate

q <- ggplot(hf)
# Age distribution colored by mortality
q+geom_density(aes(x=age, fill=status), alpha = 0.8) +
  ggtitle("Destribution of Age (density)") + xlab("Age (year)") + ylab("Density")

q+ geom_histogram(aes(x=age, fill=status),bins = 20, position = "dodge") +
  ggtitle("Destribution of Age (histogram)") + xlab("Age (year)") + ylab("Frequency")

# Los distribution colored by mortality
q+geom_density(aes(x=los, fill=status), alpha = 0.8) +
  ggtitle("Destribution of Los (density)") + xlab("Los (day)") + ylab("Density")

q+ geom_histogram(aes(x=los, fill=status),bins = 10, position = "dodge") +
  ggtitle("Destribution of Los (histogram)") + xlab("Los (day)") + ylab("Frequency")

# Glucose_num distribution colored by mortality
q+geom_density(aes(x=glucose_num, fill=status), alpha = 0.8) +
  ggtitle("Destribution of glucose_num (density)") + xlab("Glucose (mg/dL)") + ylab("Density")

q+ geom_histogram(aes(x=glucose_num, fill=status),bins = 8, position = "dodge") +
  ggtitle("Destribution of glucose_num (histogram)") + xlab("Glucose (mg/dL)") + ylab("Frequency")

# Sodium_num distribution colored by mortality
q+geom_density(aes(x=sodium_num, fill=status), alpha = 0.8) +
  ggtitle("Destribution of Sodium_num (density)") + 
  xlab("Sodium_num (mmol/L)") + ylab("Density")

q+ geom_histogram(aes(x=sodium_num, fill=status),bins = 15, position = "dodge") +
  ggtitle("Destribution of Sodium_num (histogram)") + 
  xlab("Sodium_num (mmol/L)") + ylab("Frequency")

# Wbc_count_num distribution colored by mortality
q+geom_density(aes(x=wbc_count_num, fill=status), alpha = 0.8) +
  ggtitle("Destribution of Wbc_count_num (density)") + 
  xlab("Wbc_count_num (x 10-3/mL)") + ylab("Density")

q+ geom_histogram(aes(x=wbc_count_num, fill=status),bins = 10, position = "dodge") +
  ggtitle("Destribution of Wbc_count_num (histogram)") + 
  xlab("Wbc_count_num (x 10-3/mL)") + ylab("Frequency")

# Calcium_num distribution colored by mortality
q+geom_density(aes(x=calcium_num, fill=status), alpha = 0.8) +
  ggtitle("Destribution of Calcium_num (density)") + 
  xlab("Calcium_num (mg/dL)") + ylab("Density")

q+ geom_histogram(aes(x=calcium_num, fill=status),bins = 10, position = "dodge") +
  ggtitle("Destribution of Calcium_num (histogram)") + 
  xlab("Calcium_num (mg/dL)") + ylab("Frequency")

# Hemoglobin_num distribution colored by mortality
q+geom_density(aes(x=hemoglobin_num, fill=status), alpha = 0.8) +
  ggtitle("Destribution of Hemoglobin_num (density)") + 
  xlab("Hemoglobin_num (mg/dL)") + ylab("Density")

q+ geom_histogram(aes(x=hemoglobin_num, fill=status),bins = 10, position = "dodge") +
  ggtitle("Destribution of Hemoglobin_num (histogram)") + 
  xlab("Hemoglobin_num (mg/dL)") + ylab("Frequency")

# Creatinine_num distribution colored by mortality
q+geom_density(aes(x=creatinine_num, fill=status), alpha = 0.8) +
  ggtitle("Destribution of Creatinine_num (density)") + 
  xlab("Creatinine_num (mg/dL)") + ylab("Density")

q+ geom_histogram(aes(x=creatinine_num, fill=status),bins = 10, position = "dodge") +
  ggtitle("Destribution of Creatinine_num (histogram)") + 
  xlab("Creatinine_num (mg/dL)") + ylab("Frequency")

# Urea_nitrogen_num distribution colored by mortality
q+geom_density(aes(x=urea_nitrogen_num, fill=status), alpha = 0.8) +
  ggtitle("Destribution of Urea_nitrogen_num (density)") + 
  xlab("Urea_nitrogen_num (mg/dL)") + ylab("Density")

q+ geom_histogram(aes(x=urea_nitrogen_num, fill=status),bins = 10, position = "dodge") +
  ggtitle("Destribution of Urea_nitrogen_num (histogram)") + 
  xlab("Urea_nitrogen_num (mg/dL)") + ylab("Frequency")

# Chloride_num distribution colored by mortality
q+geom_density(aes(x=chloride_num, fill=status), alpha = 0.8) +
  ggtitle("Destribution of Chloride_num (density)") + 
  xlab("Chloride_num (mmol/L)") + ylab("Density")

q+ geom_histogram(aes(x=chloride_num, fill=status),bins = 10, position = "dodge") +
  ggtitle("Destribution of Chloride_num (histogram)") + 
  xlab("Chloride_num (mmol/L)") + ylab("Frequency")

Summary

The average age of patients who were dead in CCU is older than that of patients who were alive when discharging. Los of patients are centered on 0-10 days, but the frenquency of alive patients are higher. For patients who were dead, average of hemoglobin’s level is lower than that of patients were alive. For patients who were dead, average of glucose’s level, creatinine’s level, urea_nitrogen level are higher than that of patients were alive. The distributions of sodium_num, wbc_count_num, calcium_num and chloride_num of patients are similar.

4. Create visualizations of a couple of relationships you find interesting between variables (ex: scatter plot colored by mortality).

Correlation

# Correlation matrix
hf_num <- hf
hf_num$status <- as.numeric(hf_num$status)
c <- cor(hf_num, use = "pairwise.complete.obs", method = "spearman" )
corrplot(c)

# visualizations of relationships between variables:"sodium_num & chloride_num", "creatinine_num & urea_nitrogen_num" colored by status.
q + geom_point(aes(x=sodium_num,y=chloride_num,col=status)) +
  ggtitle("Relationship between Sodium_num and Chloride_num") +
  xlab("Sodium_num (mmol/L)") + ylab("Chloride_num (mmol/L)")

q + geom_point(aes(x=creatinine_num,y=urea_nitrogen_num,col=status)) +
  ggtitle("Relationship between Creatinine_num and Urea_nitrogen_num") +
  xlab("Creatinine_num (mg/dL)") + ylab("Urea_nitrogen_num (mg/dL)")  

Summary

On the basis of correlation matrix, it shows that sodium_num is high related with chloride_num, and creatinine_num is high related with urea_nitrogen_num. The scatter plots confirmed that those pairwise variables have positive relationships between them. If logistic model is not good fitted with data, we might check variables and remove one of collinearity variables.

5. Split your data into train and test sets.

# Split data
set.seed(3033)
intrain <- createDataPartition(y = hf$status, p= 0.7, list = FALSE)
training <- hf[intrain,]
testing <- hf[-intrain,]
dim(intrain); dim(training); dim(testing)
## [1] 936   1
## [1] 936  11
## [1] 399  11
# Impute NAs - as there are a few NAs in all numeric variables, impute them with mean.
# Training set
im_train <- training
summary(im_train)
##       age          status         los           glucose_num    
##  Min.   :41.00   Alive:465   Min.   : 0.0012   Min.   :  78.0  
##  1st Qu.:63.00   dead :471   1st Qu.: 1.2677   1st Qu.: 143.0  
##  Median :73.00               Median : 2.2338   Median : 183.0  
##  Mean   :71.01               Mean   : 3.3799   Mean   : 212.7  
##  3rd Qu.:81.00               3rd Qu.: 3.9679   3rd Qu.: 247.0  
##  Max.   :88.00               Max.   :52.8108   Max.   :1601.0  
##                                                NA's   :7       
##    sodium_num    wbc_count_num     calcium_num     hemoglobin_num 
##  Min.   :123.0   Min.   :  2.50   Min.   : 5.400   Min.   : 7.90  
##  1st Qu.:140.0   1st Qu.:  9.40   1st Qu.: 8.800   1st Qu.:11.20  
##  Median :142.0   Median : 12.60   Median : 9.200   Median :12.50  
##  Mean   :141.8   Mean   : 14.35   Mean   : 9.195   Mean   :12.65  
##  3rd Qu.:144.0   3rd Qu.: 17.20   3rd Qu.: 9.500   3rd Qu.:14.00  
##  Max.   :167.0   Max.   :170.30   Max.   :23.000   Max.   :18.70  
##  NA's   :7       NA's   :7        NA's   :19       NA's   :7      
##  creatinine_num   urea_nitrogen_num  chloride_num  
##  Min.   : 0.400   Min.   :  4.00    Min.   : 87.0  
##  1st Qu.: 1.100   1st Qu.: 23.00    1st Qu.:103.0  
##  Median : 1.500   Median : 35.00    Median :106.0  
##  Mean   : 2.208   Mean   : 44.75    Mean   :106.3  
##  3rd Qu.: 2.400   3rd Qu.: 59.00    3rd Qu.:110.0  
##  Max.   :17.800   Max.   :272.00    Max.   :134.0  
##  NA's   :6        NA's   :6         NA's   :7
im_train <- data.frame(lapply(im_train, function(x) { 
  if (is.numeric(x)){
     x[is.na(x)]<- mean(x,na.rm =T)
  }
  x
}))
summary(im_train)
##       age          status         los           glucose_num    
##  Min.   :41.00   Alive:465   Min.   : 0.0012   Min.   :  78.0  
##  1st Qu.:63.00   dead :471   1st Qu.: 1.2677   1st Qu.: 143.8  
##  Median :73.00               Median : 2.2338   Median : 183.0  
##  Mean   :71.01               Mean   : 3.3799   Mean   : 212.7  
##  3rd Qu.:81.00               3rd Qu.: 3.9679   3rd Qu.: 246.2  
##  Max.   :88.00               Max.   :52.8108   Max.   :1601.0  
##    sodium_num    wbc_count_num      calcium_num     hemoglobin_num 
##  Min.   :123.0   Min.   :  2.500   Min.   : 5.400   Min.   : 7.90  
##  1st Qu.:140.0   1st Qu.:  9.475   1st Qu.: 8.800   1st Qu.:11.20  
##  Median :142.0   Median : 12.600   Median : 9.195   Median :12.50  
##  Mean   :141.8   Mean   : 14.350   Mean   : 9.195   Mean   :12.65  
##  3rd Qu.:144.0   3rd Qu.: 17.200   3rd Qu.: 9.500   3rd Qu.:14.00  
##  Max.   :167.0   Max.   :170.300   Max.   :23.000   Max.   :18.70  
##  creatinine_num   urea_nitrogen_num  chloride_num  
##  Min.   : 0.400   Min.   :  4.00    Min.   : 87.0  
##  1st Qu.: 1.100   1st Qu.: 23.00    1st Qu.:103.0  
##  Median : 1.500   Median : 36.00    Median :106.0  
##  Mean   : 2.208   Mean   : 44.75    Mean   :106.3  
##  3rd Qu.: 2.400   3rd Qu.: 59.00    3rd Qu.:110.0  
##  Max.   :17.800   Max.   :272.00    Max.   :134.0
# Testing set
im_test <- testing
summary(im_test)
##       age       status         los           glucose_num   
##  Min.   :43   Alive:198   Min.   : 0.2203   Min.   : 94.0  
##  1st Qu.:63   dead :201   1st Qu.: 1.3042   1st Qu.:141.0  
##  Median :73               Median : 2.2819   Median :176.0  
##  Mean   :71               Mean   : 3.2546   Mean   :206.7  
##  3rd Qu.:81               3rd Qu.: 3.9510   3rd Qu.:239.5  
##  Max.   :89               Max.   :30.8874   Max.   :798.0  
##                                             NA's   :4      
##    sodium_num    wbc_count_num    calcium_num     hemoglobin_num 
##  Min.   :128.0   Min.   : 4.30   Min.   : 6.900   Min.   : 8.40  
##  1st Qu.:140.0   1st Qu.: 9.55   1st Qu.: 8.800   1st Qu.:11.30  
##  Median :142.0   Median :12.70   Median : 9.100   Median :12.50  
##  Mean   :141.8   Mean   :13.69   Mean   : 9.185   Mean   :12.75  
##  3rd Qu.:144.0   3rd Qu.:16.35   3rd Qu.: 9.500   3rd Qu.:14.10  
##  Max.   :155.0   Max.   :36.50   Max.   :14.100   Max.   :19.60  
##  NA's   :4       NA's   :4       NA's   :11       NA's   :4      
##  creatinine_num   urea_nitrogen_num  chloride_num  
##  Min.   : 0.400   Min.   :  8.00    Min.   : 91.0  
##  1st Qu.: 1.100   1st Qu.: 24.00    1st Qu.:103.0  
##  Median : 1.500   Median : 35.00    Median :106.0  
##  Mean   : 2.135   Mean   : 44.94    Mean   :106.2  
##  3rd Qu.: 2.500   3rd Qu.: 57.00    3rd Qu.:109.0  
##  Max.   :11.800   Max.   :156.00    Max.   :125.0  
##  NA's   :4        NA's   :4         NA's   :4
im_test <- data.frame(lapply(im_test, function(x) { 
  if (is.numeric(x)){
     x[is.na(x)]<- mean(x,na.rm =T)
  }
  x
}))
summary(im_test)
##       age       status         los           glucose_num   
##  Min.   :43   Alive:198   Min.   : 0.2203   Min.   : 94.0  
##  1st Qu.:63   dead :201   1st Qu.: 1.3042   1st Qu.:141.0  
##  Median :73               Median : 2.2819   Median :178.0  
##  Mean   :71               Mean   : 3.2546   Mean   :206.7  
##  3rd Qu.:81               3rd Qu.: 3.9510   3rd Qu.:238.5  
##  Max.   :89               Max.   :30.8874   Max.   :798.0  
##    sodium_num    wbc_count_num    calcium_num     hemoglobin_num 
##  Min.   :128.0   Min.   : 4.30   Min.   : 6.900   Min.   : 8.40  
##  1st Qu.:140.0   1st Qu.: 9.60   1st Qu.: 8.800   1st Qu.:11.30  
##  Median :142.0   Median :12.80   Median : 9.185   Median :12.50  
##  Mean   :141.8   Mean   :13.69   Mean   : 9.185   Mean   :12.75  
##  3rd Qu.:144.0   3rd Qu.:16.25   3rd Qu.: 9.500   3rd Qu.:14.10  
##  Max.   :155.0   Max.   :36.50   Max.   :14.100   Max.   :19.60  
##  creatinine_num   urea_nitrogen_num  chloride_num  
##  Min.   : 0.400   Min.   :  8.00    Min.   : 91.0  
##  1st Qu.: 1.100   1st Qu.: 24.00    1st Qu.:103.0  
##  Median : 1.500   Median : 36.00    Median :106.0  
##  Mean   : 2.135   Mean   : 44.94    Mean   :106.2  
##  3rd Qu.: 2.500   3rd Qu.: 57.00    3rd Qu.:109.0  
##  Max.   :11.800   Max.   :156.00    Max.   :125.0

6. Fit and evaluate a logistic regression model. Be sure to include regularization and evaluate with pseudo R2. Consider providing a plot to visualize relationships revealed by your model.

fit_main <- glm(status ~ ., data = im_train, family=binomial)
fit_null <- glm(status ~ 1, data = im_train, family=binomial)
fit_final <-step(fit_null, scope=list(lower=fit_null, upper=fit_main),direction="both")
## Start:  AIC=1299.53
## status ~ 1
## 
##                     Df Deviance    AIC
## + urea_nitrogen_num  1   1168.6 1172.6
## + age                1   1198.8 1202.8
## + creatinine_num     1   1235.3 1239.3
## + hemoglobin_num     1   1257.8 1261.8
## + wbc_count_num      1   1271.3 1275.3
## + glucose_num        1   1285.6 1289.6
## + los                1   1287.8 1291.8
## <none>                   1297.5 1299.5
## + sodium_num         1   1295.6 1299.6
## + calcium_num        1   1296.6 1300.6
## + chloride_num       1   1297.0 1301.0
## 
## Step:  AIC=1172.63
## status ~ urea_nitrogen_num
## 
##                     Df Deviance    AIC
## + age                1   1090.3 1096.3
## + wbc_count_num      1   1151.7 1157.7
## + hemoglobin_num     1   1152.8 1158.8
## <none>                   1168.6 1172.6
## + calcium_num        1   1167.1 1173.1
## + glucose_num        1   1167.5 1173.5
## + los                1   1167.8 1173.8
## + sodium_num         1   1168.4 1174.4
## + creatinine_num     1   1168.5 1174.5
## + chloride_num       1   1168.6 1174.6
## - urea_nitrogen_num  1   1297.5 1299.5
## 
## Step:  AIC=1096.31
## status ~ urea_nitrogen_num + age
## 
##                     Df Deviance    AIC
## + wbc_count_num      1   1069.5 1077.5
## + hemoglobin_num     1   1085.6 1093.6
## + los                1   1087.2 1095.2
## + glucose_num        1   1087.2 1095.2
## + creatinine_num     1   1087.3 1095.3
## <none>                   1090.3 1096.3
## + chloride_num       1   1089.9 1097.9
## + calcium_num        1   1090.2 1098.2
## + sodium_num         1   1090.3 1098.3
## - age                1   1168.6 1172.6
## - urea_nitrogen_num  1   1198.8 1202.8
## 
## Step:  AIC=1077.47
## status ~ urea_nitrogen_num + age + wbc_count_num
## 
##                     Df Deviance    AIC
## + hemoglobin_num     1   1061.6 1071.6
## + creatinine_num     1   1066.8 1076.8
## + chloride_num       1   1067.2 1077.2
## <none>                   1069.5 1077.5
## + sodium_num         1   1068.9 1078.9
## + glucose_num        1   1068.9 1078.9
## + los                1   1069.0 1079.0
## + calcium_num        1   1069.3 1079.3
## - wbc_count_num      1   1090.3 1096.3
## - age                1   1151.7 1157.7
## - urea_nitrogen_num  1   1167.1 1173.1
## 
## Step:  AIC=1071.64
## status ~ urea_nitrogen_num + age + wbc_count_num + hemoglobin_num
## 
##                     Df Deviance    AIC
## + chloride_num       1   1059.2 1071.2
## <none>                   1061.6 1071.6
## + creatinine_num     1   1060.0 1072.0
## + los                1   1060.9 1072.9
## + glucose_num        1   1061.1 1073.1
## + sodium_num         1   1061.3 1073.3
## + calcium_num        1   1061.6 1073.6
## - hemoglobin_num     1   1069.5 1077.5
## - wbc_count_num      1   1085.6 1093.6
## - age                1   1131.6 1139.6
## - urea_nitrogen_num  1   1143.1 1151.1
## 
## Step:  AIC=1071.21
## status ~ urea_nitrogen_num + age + wbc_count_num + hemoglobin_num + 
##     chloride_num
## 
##                     Df Deviance    AIC
## <none>                   1059.2 1071.2
## - chloride_num       1   1061.6 1071.6
## + los                1   1057.8 1071.8
## + creatinine_num     1   1058.0 1072.0
## + glucose_num        1   1058.7 1072.7
## + sodium_num         1   1058.8 1072.8
## + calcium_num        1   1059.2 1073.2
## - hemoglobin_num     1   1067.2 1077.2
## - wbc_count_num      1   1085.2 1095.2
## - age                1   1130.9 1140.9
## - urea_nitrogen_num  1   1136.7 1146.7
summary(fit_final) # chloride_num's p-value > 0.05, is non-statistic significant, remove this variable.
## 
## Call:
## glm(formula = status ~ urea_nitrogen_num + age + wbc_count_num + 
##     hemoglobin_num + chloride_num, family = binomial, data = im_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.83477  -0.95015   0.08639   0.95255   2.15081  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -2.196087   1.703489  -1.289  0.19734    
## urea_nitrogen_num  0.025854   0.003225   8.017 1.08e-15 ***
## age                0.058539   0.007268   8.054 8.03e-16 ***
## wbc_count_num      0.061793   0.013705   4.509 6.52e-06 ***
## hemoglobin_num    -0.122085   0.043450  -2.810  0.00496 ** 
## chloride_num      -0.022570   0.014499  -1.557  0.11956    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1297.5  on 935  degrees of freedom
## Residual deviance: 1059.2  on 930  degrees of freedom
## AIC: 1071.2
## 
## Number of Fisher Scoring iterations: 5
fit_final2 <- glm(status ~ urea_nitrogen_num + age + wbc_count_num + 
    hemoglobin_num, data = im_train, family=binomial)
summary(fit_final2)
## 
## Call:
## glm(formula = status ~ urea_nitrogen_num + age + wbc_count_num + 
##     hemoglobin_num, family = binomial, data = im_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.78168  -0.96077   0.08109   0.96166   2.11101  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -4.508316   0.847339  -5.321 1.03e-07 ***
## urea_nitrogen_num  0.026390   0.003216   8.205 2.31e-16 ***
## age                0.057464   0.007210   7.970 1.58e-15 ***
## wbc_count_num      0.057886   0.013362   4.332 1.48e-05 ***
## hemoglobin_num    -0.120442   0.043326  -2.780  0.00544 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1297.5  on 935  degrees of freedom
## Residual deviance: 1061.6  on 931  degrees of freedom
## AIC: 1071.6
## 
## Number of Fisher Scoring iterations: 5
pR2(fit_final2)
##          llh      llhNull           G2     McFadden         r2ML 
## -530.8223568 -648.7665301  235.8883465    0.1817976    0.2227688 
##         r2CU 
##    0.2970292
pred_glm <-predict(fit_final2, im_test)
df_compare_glm <- data.frame(pred_glm,im_test$status)
# May view dataframe df_compare
plot(df_compare_glm)

test_pred <- ifelse(pred_glm >=0.5, "dead","Alive")
test_pred <- factor(test_pred, levels = c("Alive","dead"))
confusionMatrix(test_pred, im_test$status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Alive dead
##      Alive   167  101
##      dead     31  100
##                                           
##                Accuracy : 0.6692          
##                  95% CI : (0.6206, 0.7152)
##     No Information Rate : 0.5038          
##     P-Value [Acc > NIR] : 1.749e-11       
##                                           
##                   Kappa : 0.34            
##                                           
##  Mcnemar's Test P-Value : 1.905e-09       
##                                           
##             Sensitivity : 0.8434          
##             Specificity : 0.4975          
##          Pos Pred Value : 0.6231          
##          Neg Pred Value : 0.7634          
##              Prevalence : 0.4962          
##          Detection Rate : 0.4185          
##    Detection Prevalence : 0.6717          
##       Balanced Accuracy : 0.6705          
##                                           
##        'Positive' Class : Alive           
## 
# ROC curve
p_log <- predict(fit_final2, newdata=im_test, type="response")
pr_log <- prediction(pred_glm, im_test$status)
prf_log <- performance(pr_log, measure = "tpr", x.measure = "fpr")
par(pty = "s")
plot(prf_log, colorize = T, 
     main="ROC curve of logistic regression")
abline(a=0, b=1)
auc_log <- performance(pr_log, measure = "auc")
auc_log <- round(auc_log@y.values[[1]],5)
legend(.6,.2,auc_log,title="AUC",cex = .8)

Summary of logistic regression model

Final model includes 4 predictors which each p-value is far less than 0.05: urea_nitrogen_num, age, wbc_count_num, hemoglobin_num. The equation is: logit(P) = -4.51 + 0.026urea_nitrogen_num + 0.057age + 0.058wbc_count_num - 0.12hemoglobin_num

For 1 mg/dL increases in urea_nitrogen_num, the odds of mortality of heart failure patients is multiplied by 1.026 [exp(0.026)] on average, assuming other variables are held constant.

For 1 year increases in age, the odds of mortality of heart failure patients is multiplied by 1.059 [exp(0.057)] on average, assuming other variables are held constant.

For 1 x 10-3/mL increases in wbc_count_num, the odds of mortality of heart failure patients is multiplied by 1.06 [exp(0.058)] on average, assuming other variables are held constant.

For 1 mg/dL increases in hemoglobin_num, the odds of mortality of heart failure patients is multiplied by 0.88 [exp(-0.12)] on average, assuming other variables are held constant.

In pR2 results, McFadden is 0.182. It indicates that this model does not fit the data well, it can only explain 18% values of response variable in the dataset.The reason is our observations is not big enough, and the predictors might not include all of the vital predictors for mortality of heart failure patients. Plot of df_compare_glm confirms that this model does not predict well. Points are destributed widely comparing with true value.

Confusion Matrix indicates that this model could predict 100 patients mortality out of 201 patients who were dead in reality, and 167 alive patients out of 198 patients. Although this model predicts patients alive well, but predicting patients dead is more important and meaningful for clinicians and healthcare.

7. Fit and evaluate an SVM classifier, trying linear, poly and RBF kernels. Be sure to tune hyperparameters so as to avoid underfitting and overfitting.

# Standardize data first
trctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3,classProbs = TRUE,
                       summaryFunction = twoClassSummary)
set.seed(3233)

svm_linear <- train(status ~., data =im_train, method = "svmLinear",
                    trControl=trctrl,
                    preProcess = c("center", "scale"),
                    tuneLength = 10,
                    metric = "ROC")
svm_linear
## Support Vector Machines with Linear Kernel 
## 
## 936 samples
##  10 predictor
##   2 classes: 'Alive', 'dead' 
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 748, 749, 749, 749, 749, 749, ... 
## Resampling results:
## 
##   ROC        Sens     Spec     
##   0.7681942  0.69319  0.6801344
## 
## Tuning parameter 'C' was held constant at a value of 1
svm_linear$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 637 
## 
## Objective Function Value : -630.6096 
## Training error : 0.306624 
## Probability model included.
test_pred_sl<- predict(svm_linear, newdata = im_test)
confusionMatrix(test_pred_sl, im_test$status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Alive dead
##      Alive   136   58
##      dead     62  143
##                                           
##                Accuracy : 0.6992          
##                  95% CI : (0.6516, 0.7439)
##     No Information Rate : 0.5038          
##     P-Value [Acc > NIR] : 1.778e-15       
##                                           
##                   Kappa : 0.3984          
##                                           
##  Mcnemar's Test P-Value : 0.7842          
##                                           
##             Sensitivity : 0.6869          
##             Specificity : 0.7114          
##          Pos Pred Value : 0.7010          
##          Neg Pred Value : 0.6976          
##              Prevalence : 0.4962          
##          Detection Rate : 0.3409          
##    Detection Prevalence : 0.4862          
##       Balanced Accuracy : 0.6992          
##                                           
##        'Positive' Class : Alive           
## 
par(pty = "s")
test_pred_sl2 <- predict(svm_linear, newdata = im_test,type="prob")
roc(im_test$status,test_pred_sl2[,2],plot=T,legacy.axes = T, col="purple",
    main="ROC Curve of SVMLinear Model", xlab="FPR", ylab = "TPR", 
    print.auc = T, print.auc.x=0.4,print.auc.y=0.3)

## 
## Call:
## roc.default(response = im_test$status, predictor = test_pred_sl2[,     2], plot = T, legacy.axes = T, col = "purple", main = "ROC Curve of SVMLinear Model",     xlab = "FPR", ylab = "TPR", print.auc = T, print.auc.x = 0.4,     print.auc.y = 0.3)
## 
## Data: test_pred_sl2[, 2] in 198 controls (im_test$status Alive) < 201 cases (im_test$status dead).
## Area under the curve: 0.7477
#SVM Polynomial Model
set.seed(3233)
trctrl_sp <- trainControl(method = "repeatedcv", number = 3, repeats = 3,classProbs = TRUE,
                       summaryFunction = twoClassSummary)
svm_poly <- train(status ~., data =im_train, method = "svmPoly",
                    trControl=trctrl,
                    preProcess = c("center", "scale"),
                    tuneLength = 3,
                    metric = "ROC")

svm_poly
## Support Vector Machines with Polynomial Kernel 
## 
## 936 samples
##  10 predictor
##   2 classes: 'Alive', 'dead' 
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 748, 749, 749, 749, 749, 749, ... 
## Resampling results across tuning parameters:
## 
##   degree  scale  C     ROC        Sens       Spec     
##   1       0.001  0.25  0.7658440  0.9362007  0.3029041
##   1       0.001  0.50  0.7658516  0.9347670  0.2994102
##   1       0.001  1.00  0.7667078  0.7125448  0.6531616
##   1       0.010  0.25  0.7661719  0.6551971  0.7260918
##   1       0.010  0.50  0.7681771  0.6716846  0.7069802
##   1       0.010  1.00  0.7678975  0.6845878  0.6921463
##   1       0.100  0.25  0.7682457  0.6817204  0.6907428
##   1       0.100  0.50  0.7691454  0.6860215  0.6864875
##   1       0.100  1.00  0.7679275  0.6896057  0.6843897
##   2       0.001  0.25  0.7659888  0.9354839  0.3000971
##   2       0.001  0.50  0.7668291  0.7168459  0.6453453
##   2       0.001  1.00  0.7662854  0.6494624  0.7367301
##   2       0.010  0.25  0.7707176  0.6724014  0.7112430
##   2       0.010  0.50  0.7706477  0.6910394  0.6815230
##   2       0.010  1.00  0.7707136  0.6888889  0.6871892
##   2       0.100  0.25  0.7658659  0.6967742  0.6751474
##   2       0.100  0.50  0.7603310  0.6967742  0.6779993
##   2       0.100  1.00  0.7566650  0.6989247  0.6730422
##   3       0.001  0.25  0.7664133  0.9146953  0.3466891
##   3       0.001  0.50  0.7663531  0.6666667  0.7077044
##   3       0.001  1.00  0.7673892  0.6645161  0.7126316
##   3       0.010  0.25  0.7709464  0.6896057  0.6815155
##   3       0.010  0.50  0.7709002  0.6967742  0.6822396
##   3       0.010  1.00  0.7706426  0.6874552  0.6857783
##   3       0.100  0.25  0.7465041  0.6724014  0.6751922
##   3       0.100  0.50  0.7382985  0.6810036  0.6575215
##   3       0.100  1.00  0.7268847  0.6767025  0.6526017
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 0.01 and C
##  = 0.25.
svm_poly$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 0.25 
## 
## Polynomial kernel function. 
##  Hyperparameters : degree =  3  scale =  0.01  offset =  1 
## 
## Number of Support Vectors : 715 
## 
## Objective Function Value : -167.6995 
## Training error : 0.300214 
## Probability model included.
test_pred_sp <- predict(svm_poly, newdata = im_test)
confusionMatrix(test_pred_sp, im_test$status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Alive dead
##      Alive   140   65
##      dead     58  136
##                                           
##                Accuracy : 0.6917          
##                  95% CI : (0.6439, 0.7367)
##     No Information Rate : 0.5038          
##     P-Value [Acc > NIR] : 2.067e-14       
##                                           
##                   Kappa : 0.3836          
##                                           
##  Mcnemar's Test P-Value : 0.5885          
##                                           
##             Sensitivity : 0.7071          
##             Specificity : 0.6766          
##          Pos Pred Value : 0.6829          
##          Neg Pred Value : 0.7010          
##              Prevalence : 0.4962          
##          Detection Rate : 0.3509          
##    Detection Prevalence : 0.5138          
##       Balanced Accuracy : 0.6918          
##                                           
##        'Positive' Class : Alive           
## 
par(pty = "s")
test_pred_sp2 <- predict(svm_poly, newdata = im_test,type="prob")
roc(im_test$status,test_pred_sp2[,2],plot=T,legacy.axes = T, col="blue",
    main="ROC Curve of SVMPoly Model", xlab="FPR", ylab = "TPR", 
    print.auc = T, print.auc.x=0.4,print.auc.y=0.3)

## 
## Call:
## roc.default(response = im_test$status, predictor = test_pred_sp2[,     2], plot = T, legacy.axes = T, col = "blue", main = "ROC Curve of SVMPoly Model",     xlab = "FPR", ylab = "TPR", print.auc = T, print.auc.x = 0.4,     print.auc.y = 0.3)
## 
## Data: test_pred_sp2[, 2] in 198 controls (im_test$status Alive) < 201 cases (im_test$status dead).
## Area under the curve: 0.7418
# SVM RBF Model
set.seed(3233)
svm_RBF <- train(status ~., data =im_train, method = "svmRadial",
                    trControl=trctrl,
                    preProcess = c("center", "scale"),
                    tuneLength = 10,
                    metric = "ROC")

svm_RBF
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 936 samples
##  10 predictor
##   2 classes: 'Alive', 'dead' 
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 748, 749, 749, 749, 749, 749, ... 
## Resampling results across tuning parameters:
## 
##   C       ROC        Sens       Spec     
##     0.25  0.7628504  0.6551971  0.7162150
##     0.50  0.7595249  0.6609319  0.7190444
##     1.00  0.7529735  0.6566308  0.7063233
##     2.00  0.7445889  0.6681004  0.6907951
##     4.00  0.7316517  0.6559140  0.6872340
##     8.00  0.7125140  0.6415771  0.6702426
##    16.00  0.6942941  0.6408602  0.6524897
##    32.00  0.6773674  0.6344086  0.6341172
##    64.00  0.6662330  0.6229391  0.6221202
##   128.00  0.6598725  0.6121864  0.6192982
## 
## Tuning parameter 'sigma' was held constant at a value of 0.1173631
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.1173631 and C = 0.25.
svm_RBF$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 0.25 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.11736310351346 
## 
## Number of Support Vectors : 699 
## 
## Objective Function Value : -155.1019 
## Training error : 0.287393 
## Probability model included.
test_pred_sr <- predict(svm_RBF, newdata = im_test)
confusionMatrix(test_pred_sr, im_test$status)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Alive dead
##      Alive   132   63
##      dead     66  138
##                                           
##                Accuracy : 0.6767          
##                  95% CI : (0.6284, 0.7224)
##     No Information Rate : 0.5038          
##     P-Value [Acc > NIR] : 2.047e-12       
##                                           
##                   Kappa : 0.3533          
##                                           
##  Mcnemar's Test P-Value : 0.8602          
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.6866          
##          Pos Pred Value : 0.6769          
##          Neg Pred Value : 0.6765          
##              Prevalence : 0.4962          
##          Detection Rate : 0.3308          
##    Detection Prevalence : 0.4887          
##       Balanced Accuracy : 0.6766          
##                                           
##        'Positive' Class : Alive           
## 
par(pty = "s")
test_pred_sr2 <- predict(svm_RBF, newdata = im_test,type="prob")
roc(im_test$status,test_pred_sr2[,2],plot=T,legacy.axes = T, col="red",
    main="ROC Curve of SVMRBF Model", xlab="FPR", ylab = "TPR",
    print.auc = T, print.auc.x=0.4,print.auc.y=0.3)

## 
## Call:
## roc.default(response = im_test$status, predictor = test_pred_sr2[,     2], plot = T, legacy.axes = T, col = "red", main = "ROC Curve of SVMRBF Model",     xlab = "FPR", ylab = "TPR", print.auc = T, print.auc.x = 0.4,     print.auc.y = 0.3)
## 
## Data: test_pred_sr2[, 2] in 198 controls (im_test$status Alive) < 201 cases (im_test$status dead).
## Area under the curve: 0.746

8. Fit and evaluate a decision tree. Be sure to tune hyper-parameters.

set.seed(3233)
dtree_fit <- train(status ~., data = im_train, method = "rpart",
                   parms = list(split = "information"),
                   trControl=trctrl, 
                   tuneLength = 10,
                   metric = "ROC")

dtree_fit
## CART 
## 
## 936 samples
##  10 predictor
##   2 classes: 'Alive', 'dead' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 748, 749, 749, 749, 749, 749, ... 
## Resampling results across tuning parameters:
## 
##   cp           ROC        Sens       Spec     
##   0.001075269  0.7046784  0.6537634  0.6235685
##   0.002150538  0.7042373  0.6602151  0.6313475
##   0.003584229  0.7038417  0.6401434  0.6730123
##   0.005017921  0.7046679  0.6501792  0.6935424
##   0.005376344  0.7120532  0.6523297  0.7020157
##   0.006451613  0.7131921  0.6422939  0.7176334
##   0.007526882  0.7155879  0.6322581  0.7289586
##   0.008602151  0.7209940  0.6243728  0.7445166
##   0.032258065  0.7047283  0.7132616  0.6580814
##   0.363440860  0.6134690  0.5118280  0.7151101
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.008602151.
test_pred_dt <- predict(dtree_fit, newdata = im_test)
confusionMatrix(test_pred_dt, im_test$status )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Alive dead
##      Alive   109   39
##      dead     89  162
##                                           
##                Accuracy : 0.6792          
##                  95% CI : (0.6309, 0.7248)
##     No Information Rate : 0.5038          
##     P-Value [Acc > NIR] : 9.791e-13       
##                                           
##                   Kappa : 0.3571          
##                                           
##  Mcnemar's Test P-Value : 1.484e-05       
##                                           
##             Sensitivity : 0.5505          
##             Specificity : 0.8060          
##          Pos Pred Value : 0.7365          
##          Neg Pred Value : 0.6454          
##              Prevalence : 0.4962          
##          Detection Rate : 0.2732          
##    Detection Prevalence : 0.3709          
##       Balanced Accuracy : 0.6782          
##                                           
##        'Positive' Class : Alive           
## 
prp(dtree_fit$finalModel, box.palette = "Reds")

par(pty = "s")
test_pred_dt2 <- predict(dtree_fit, newdata = im_test,type="prob")
roc(im_test$status,test_pred_dt2[,2], plot=T,legacy.axes = T, col="orange",
    main="ROC Curve of Decision Tree Model", xlab="FPR", ylab = "TPR",
     print.auc = T, print.auc.x=0.4,print.auc.y=0.3)

## 
## Call:
## roc.default(response = im_test$status, predictor = test_pred_dt2[,     2], plot = T, legacy.axes = T, col = "orange", main = "ROC Curve of Decision Tree Model",     xlab = "FPR", ylab = "TPR", print.auc = T, print.auc.x = 0.4,     print.auc.y = 0.3)
## 
## Data: test_pred_dt2[, 2] in 198 controls (im_test$status Alive) < 201 cases (im_test$status dead).
## Area under the curve: 0.7005

9. Fit a random forest model. Be sure to tune hyper-parameters.

set.seed(3233)
rf_fit <- train(status~., data=im_train, method="rf", 
                preProcess = c("center", "scale"),
                trControl=trctrl, 
                tuneLength = 10,
                metric = "ROC")
## note: only 9 unique complexity parameters in default grid. Truncating the grid to 9 .
rf_fit
## Random Forest 
## 
## 936 samples
##  10 predictor
##   2 classes: 'Alive', 'dead' 
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 748, 749, 749, 749, 749, 749, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##    2    0.7659809  0.6666667  0.7077566
##    3    0.7653163  0.6695341  0.7027996
##    4    0.7635464  0.6645161  0.7042031
##    5    0.7649400  0.6702509  0.7084658
##    6    0.7607980  0.6645161  0.7055991
##    7    0.7608882  0.6716846  0.7013438
##    8    0.7605067  0.6695341  0.7034789
##    9    0.7603504  0.6709677  0.6942516
##   10    0.7593865  0.6716846  0.7063009
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
rf_fit$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 30.34%
## Confusion matrix:
##       Alive dead class.error
## Alive   308  157   0.3376344
## dead    127  344   0.2696391
test_pred_rf <- predict(rf_fit, newdata = im_test)
confusionMatrix(test_pred_rf, im_test$status )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Alive dead
##      Alive   126   50
##      dead     72  151
##                                           
##                Accuracy : 0.6942          
##                  95% CI : (0.6465, 0.7391)
##     No Information Rate : 0.5038          
##     P-Value [Acc > NIR] : 9.231e-15       
##                                           
##                   Kappa : 0.3879          
##                                           
##  Mcnemar's Test P-Value : 0.05727         
##                                           
##             Sensitivity : 0.6364          
##             Specificity : 0.7512          
##          Pos Pred Value : 0.7159          
##          Neg Pred Value : 0.6771          
##              Prevalence : 0.4962          
##          Detection Rate : 0.3158          
##    Detection Prevalence : 0.4411          
##       Balanced Accuracy : 0.6938          
##                                           
##        'Positive' Class : Alive           
## 
par(pty = "s")
test_pred_rf2 <- predict(rf_fit, newdata = im_test,type="prob")
roc(im_test$status,test_pred_rf2[,2], plot=T,legacy.axes = T, col="brown",
    main="ROC Curve of Random Forsest Tree Model", xlab="FPR", ylab = "TPR",
     print.auc = T, print.auc.x=0.4,print.auc.y=0.3)

## 
## Call:
## roc.default(response = im_test$status, predictor = test_pred_rf2[,     2], plot = T, legacy.axes = T, col = "brown", main = "ROC Curve of Random Forsest Tree Model",     xlab = "FPR", ylab = "TPR", print.auc = T, print.auc.x = 0.4,     print.auc.y = 0.3)
## 
## Data: test_pred_rf2[, 2] in 198 controls (im_test$status Alive) < 201 cases (im_test$status dead).
## Area under the curve: 0.7596

10. Fit a xgboost model. Be sure to tune hyper-parameters.

set.seed(3233)
trctrl_xg <- trainControl(method = "repeatedcv", number = 4, repeats = 3,classProbs = TRUE,
                       summaryFunction = twoClassSummary)
xgb_fit <- train(status~., data=im_train, method="xgbTree", 
                preProcess = c("center", "scale"),
                trControl=trctrl_xg, 
                tuneLength = 4,
                metric = "ROC")
              
xgb_fit
## eXtreme Gradient Boosting 
## 
## 936 samples
##  10 predictor
##   2 classes: 'Alive', 'dead' 
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (4 fold, repeated 3 times) 
## Summary of sample sizes: 701, 702, 702, 703, 702, 702, ... 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  colsample_bytree  subsample  nrounds  ROC      
##   0.3  1          0.6               0.5000000   50      0.7771503
##   0.3  1          0.6               0.5000000  100      0.7691565
##   0.3  1          0.6               0.5000000  150      0.7611213
##   0.3  1          0.6               0.5000000  200      0.7584341
##   0.3  1          0.6               0.6666667   50      0.7773957
##   0.3  1          0.6               0.6666667  100      0.7730846
##   0.3  1          0.6               0.6666667  150      0.7679715
##   0.3  1          0.6               0.6666667  200      0.7601844
##   0.3  1          0.6               0.8333333   50      0.7821540
##   0.3  1          0.6               0.8333333  100      0.7727647
##   0.3  1          0.6               0.8333333  150      0.7693853
##   0.3  1          0.6               0.8333333  200      0.7645494
##   0.3  1          0.6               1.0000000   50      0.7841870
##   0.3  1          0.6               1.0000000  100      0.7789885
##   0.3  1          0.6               1.0000000  150      0.7735415
##   0.3  1          0.6               1.0000000  200      0.7717717
##   0.3  1          0.8               0.5000000   50      0.7720794
##   0.3  1          0.8               0.5000000  100      0.7644588
##   0.3  1          0.8               0.5000000  150      0.7595396
##   0.3  1          0.8               0.5000000  200      0.7573526
##   0.3  1          0.8               0.6666667   50      0.7774807
##   0.3  1          0.8               0.6666667  100      0.7699465
##   0.3  1          0.8               0.6666667  150      0.7618048
##   0.3  1          0.8               0.6666667  200      0.7593121
##   0.3  1          0.8               0.8333333   50      0.7810612
##   0.3  1          0.8               0.8333333  100      0.7750548
##   0.3  1          0.8               0.8333333  150      0.7673825
##   0.3  1          0.8               0.8333333  200      0.7629095
##   0.3  1          0.8               1.0000000   50      0.7847730
##   0.3  1          0.8               1.0000000  100      0.7788198
##   0.3  1          0.8               1.0000000  150      0.7752112
##   0.3  1          0.8               1.0000000  200      0.7724614
##   0.3  2          0.6               0.5000000   50      0.7552393
##   0.3  2          0.6               0.5000000  100      0.7461319
##   0.3  2          0.6               0.5000000  150      0.7399412
##   0.3  2          0.6               0.5000000  200      0.7380716
##   0.3  2          0.6               0.6666667   50      0.7636822
##   0.3  2          0.6               0.6666667  100      0.7506320
##   0.3  2          0.6               0.6666667  150      0.7451769
##   0.3  2          0.6               0.6666667  200      0.7395856
##   0.3  2          0.6               0.8333333   50      0.7702199
##   0.3  2          0.6               0.8333333  100      0.7550748
##   0.3  2          0.6               0.8333333  150      0.7445450
##   0.3  2          0.6               0.8333333  200      0.7400318
##   0.3  2          0.6               1.0000000   50      0.7701609
##   0.3  2          0.6               1.0000000  100      0.7561301
##   0.3  2          0.6               1.0000000  150      0.7489352
##   0.3  2          0.6               1.0000000  200      0.7448596
##   0.3  2          0.8               0.5000000   50      0.7558514
##   0.3  2          0.8               0.5000000  100      0.7403381
##   0.3  2          0.8               0.5000000  150      0.7327514
##   0.3  2          0.8               0.5000000  200      0.7252684
##   0.3  2          0.8               0.6666667   50      0.7579367
##   0.3  2          0.8               0.6666667  100      0.7537244
##   0.3  2          0.8               0.6666667  150      0.7453853
##   0.3  2          0.8               0.6666667  200      0.7415638
##   0.3  2          0.8               0.8333333   50      0.7671362
##   0.3  2          0.8               0.8333333  100      0.7533784
##   0.3  2          0.8               0.8333333  150      0.7447734
##   0.3  2          0.8               0.8333333  200      0.7393841
##   0.3  2          0.8               1.0000000   50      0.7727110
##   0.3  2          0.8               1.0000000  100      0.7556568
##   0.3  2          0.8               1.0000000  150      0.7479787
##   0.3  2          0.8               1.0000000  200      0.7420088
##   0.3  3          0.6               0.5000000   50      0.7480872
##   0.3  3          0.6               0.5000000  100      0.7376333
##   0.3  3          0.6               0.5000000  150      0.7357309
##   0.3  3          0.6               0.5000000  200      0.7308237
##   0.3  3          0.6               0.6666667   50      0.7473218
##   0.3  3          0.6               0.6666667  100      0.7402820
##   0.3  3          0.6               0.6666667  150      0.7378992
##   0.3  3          0.6               0.6666667  200      0.7347381
##   0.3  3          0.6               0.8333333   50      0.7576240
##   0.3  3          0.6               0.8333333  100      0.7487370
##   0.3  3          0.6               0.8333333  150      0.7443169
##   0.3  3          0.6               0.8333333  200      0.7416555
##   0.3  3          0.6               1.0000000   50      0.7642371
##   0.3  3          0.6               1.0000000  100      0.7483240
##   0.3  3          0.6               1.0000000  150      0.7450523
##   0.3  3          0.6               1.0000000  200      0.7437489
##   0.3  3          0.8               0.5000000   50      0.7481361
##   0.3  3          0.8               0.5000000  100      0.7372225
##   0.3  3          0.8               0.5000000  150      0.7314623
##   0.3  3          0.8               0.5000000  200      0.7291811
##   0.3  3          0.8               0.6666667   50      0.7517104
##   0.3  3          0.8               0.6666667  100      0.7418266
##   0.3  3          0.8               0.6666667  150      0.7372018
##   0.3  3          0.8               0.6666667  200      0.7342139
##   0.3  3          0.8               0.8333333   50      0.7571534
##   0.3  3          0.8               0.8333333  100      0.7485562
##   0.3  3          0.8               0.8333333  150      0.7438218
##   0.3  3          0.8               0.8333333  200      0.7402906
##   0.3  3          0.8               1.0000000   50      0.7621893
##   0.3  3          0.8               1.0000000  100      0.7474452
##   0.3  3          0.8               1.0000000  150      0.7437951
##   0.3  3          0.8               1.0000000  200      0.7418871
##   0.3  4          0.6               0.5000000   50      0.7494025
##   0.3  4          0.6               0.5000000  100      0.7408875
##   0.3  4          0.6               0.5000000  150      0.7389230
##   0.3  4          0.6               0.5000000  200      0.7404929
##   0.3  4          0.6               0.6666667   50      0.7524742
##   0.3  4          0.6               0.6666667  100      0.7457182
##   0.3  4          0.6               0.6666667  150      0.7423856
##   0.3  4          0.6               0.6666667  200      0.7397532
##   0.3  4          0.6               0.8333333   50      0.7505011
##   0.3  4          0.6               0.8333333  100      0.7392299
##   0.3  4          0.6               0.8333333  150      0.7389005
##   0.3  4          0.6               0.8333333  200      0.7355946
##   0.3  4          0.6               1.0000000   50      0.7524177
##   0.3  4          0.6               1.0000000  100      0.7458274
##   0.3  4          0.6               1.0000000  150      0.7430005
##   0.3  4          0.6               1.0000000  200      0.7395298
##   0.3  4          0.8               0.5000000   50      0.7421233
##   0.3  4          0.8               0.5000000  100      0.7333384
##   0.3  4          0.8               0.5000000  150      0.7332263
##   0.3  4          0.8               0.5000000  200      0.7317416
##   0.3  4          0.8               0.6666667   50      0.7372878
##   0.3  4          0.8               0.6666667  100      0.7342650
##   0.3  4          0.8               0.6666667  150      0.7343053
##   0.3  4          0.8               0.6666667  200      0.7312627
##   0.3  4          0.8               0.8333333   50      0.7500758
##   0.3  4          0.8               0.8333333  100      0.7456205
##   0.3  4          0.8               0.8333333  150      0.7405931
##   0.3  4          0.8               0.8333333  200      0.7390485
##   0.3  4          0.8               1.0000000   50      0.7514692
##   0.3  4          0.8               1.0000000  100      0.7450334
##   0.3  4          0.8               1.0000000  150      0.7430389
##   0.3  4          0.8               1.0000000  200      0.7416189
##   0.4  1          0.6               0.5000000   50      0.7722520
##   0.4  1          0.6               0.5000000  100      0.7655410
##   0.4  1          0.6               0.5000000  150      0.7586266
##   0.4  1          0.6               0.5000000  200      0.7521452
##   0.4  1          0.6               0.6666667   50      0.7727406
##   0.4  1          0.6               0.6666667  100      0.7635691
##   0.4  1          0.6               0.6666667  150      0.7592640
##   0.4  1          0.6               0.6666667  200      0.7556049
##   0.4  1          0.6               0.8333333   50      0.7755629
##   0.4  1          0.6               0.8333333  100      0.7647290
##   0.4  1          0.6               0.8333333  150      0.7620215
##   0.4  1          0.6               0.8333333  200      0.7565957
##   0.4  1          0.6               1.0000000   50      0.7807951
##   0.4  1          0.6               1.0000000  100      0.7731696
##   0.4  1          0.6               1.0000000  150      0.7679442
##   0.4  1          0.6               1.0000000  200      0.7658862
##   0.4  1          0.8               0.5000000   50      0.7690274
##   0.4  1          0.8               0.5000000  100      0.7563383
##   0.4  1          0.8               0.5000000  150      0.7464715
##   0.4  1          0.8               0.5000000  200      0.7425717
##   0.4  1          0.8               0.6666667   50      0.7753294
##   0.4  1          0.8               0.6666667  100      0.7641090
##   0.4  1          0.8               0.6666667  150      0.7582126
##   0.4  1          0.8               0.6666667  200      0.7496569
##   0.4  1          0.8               0.8333333   50      0.7772010
##   0.4  1          0.8               0.8333333  100      0.7667864
##   0.4  1          0.8               0.8333333  150      0.7592413
##   0.4  1          0.8               0.8333333  200      0.7563944
##   0.4  1          0.8               1.0000000   50      0.7826603
##   0.4  1          0.8               1.0000000  100      0.7741028
##   0.4  1          0.8               1.0000000  150      0.7716598
##   0.4  1          0.8               1.0000000  200      0.7674480
##   0.4  2          0.6               0.5000000   50      0.7488612
##   0.4  2          0.6               0.5000000  100      0.7439722
##   0.4  2          0.6               0.5000000  150      0.7362162
##   0.4  2          0.6               0.5000000  200      0.7300281
##   0.4  2          0.6               0.6666667   50      0.7645254
##   0.4  2          0.6               0.6666667  100      0.7527623
##   0.4  2          0.6               0.6666667  150      0.7397998
##   0.4  2          0.6               0.6666667  200      0.7370224
##   0.4  2          0.6               0.8333333   50      0.7593424
##   0.4  2          0.6               0.8333333  100      0.7460863
##   0.4  2          0.6               0.8333333  150      0.7392883
##   0.4  2          0.6               0.8333333  200      0.7345042
##   0.4  2          0.6               1.0000000   50      0.7686791
##   0.4  2          0.6               1.0000000  100      0.7509756
##   0.4  2          0.6               1.0000000  150      0.7427852
##   0.4  2          0.6               1.0000000  200      0.7377539
##   0.4  2          0.8               0.5000000   50      0.7553110
##   0.4  2          0.8               0.5000000  100      0.7436727
##   0.4  2          0.8               0.5000000  150      0.7457551
##   0.4  2          0.8               0.5000000  200      0.7329734
##   0.4  2          0.8               0.6666667   50      0.7533077
##   0.4  2          0.8               0.6666667  100      0.7452239
##   0.4  2          0.8               0.6666667  150      0.7397192
##   0.4  2          0.8               0.6666667  200      0.7356257
##   0.4  2          0.8               0.8333333   50      0.7585926
##   0.4  2          0.8               0.8333333  100      0.7414705
##   0.4  2          0.8               0.8333333  150      0.7386803
##   0.4  2          0.8               0.8333333  200      0.7371954
##   0.4  2          0.8               1.0000000   50      0.7662917
##   0.4  2          0.8               1.0000000  100      0.7485021
##   0.4  2          0.8               1.0000000  150      0.7412513
##   0.4  2          0.8               1.0000000  200      0.7362407
##   0.4  3          0.6               0.5000000   50      0.7408208
##   0.4  3          0.6               0.5000000  100      0.7280308
##   0.4  3          0.6               0.5000000  150      0.7284110
##   0.4  3          0.6               0.5000000  200      0.7302277
##   0.4  3          0.6               0.6666667   50      0.7384999
##   0.4  3          0.6               0.6666667  100      0.7329009
##   0.4  3          0.6               0.6666667  150      0.7348765
##   0.4  3          0.6               0.6666667  200      0.7333970
##   0.4  3          0.6               0.8333333   50      0.7516424
##   0.4  3          0.6               0.8333333  100      0.7435843
##   0.4  3          0.6               0.8333333  150      0.7358943
##   0.4  3          0.6               0.8333333  200      0.7314122
##   0.4  3          0.6               1.0000000   50      0.7559548
##   0.4  3          0.6               1.0000000  100      0.7437773
##   0.4  3          0.6               1.0000000  150      0.7401895
##   0.4  3          0.6               1.0000000  200      0.7365360
##   0.4  3          0.8               0.5000000   50      0.7353193
##   0.4  3          0.8               0.5000000  100      0.7248205
##   0.4  3          0.8               0.5000000  150      0.7257698
##   0.4  3          0.8               0.5000000  200      0.7246873
##   0.4  3          0.8               0.6666667   50      0.7445538
##   0.4  3          0.8               0.6666667  100      0.7380094
##   0.4  3          0.8               0.6666667  150      0.7375367
##   0.4  3          0.8               0.6666667  200      0.7366888
##   0.4  3          0.8               0.8333333   50      0.7458574
##   0.4  3          0.8               0.8333333  100      0.7385728
##   0.4  3          0.8               0.8333333  150      0.7350573
##   0.4  3          0.8               0.8333333  200      0.7349095
##   0.4  3          0.8               1.0000000   50      0.7557012
##   0.4  3          0.8               1.0000000  100      0.7425303
##   0.4  3          0.8               1.0000000  150      0.7396849
##   0.4  3          0.8               1.0000000  200      0.7379944
##   0.4  4          0.6               0.5000000   50      0.7334725
##   0.4  4          0.6               0.5000000  100      0.7336885
##   0.4  4          0.6               0.5000000  150      0.7316072
##   0.4  4          0.6               0.5000000  200      0.7313384
##   0.4  4          0.6               0.6666667   50      0.7392119
##   0.4  4          0.6               0.6666667  100      0.7301911
##   0.4  4          0.6               0.6666667  150      0.7301357
##   0.4  4          0.6               0.6666667  200      0.7293951
##   0.4  4          0.6               0.8333333   50      0.7409025
##   0.4  4          0.6               0.8333333  100      0.7375138
##   0.4  4          0.6               0.8333333  150      0.7350785
##   0.4  4          0.6               0.8333333  200      0.7326485
##   0.4  4          0.6               1.0000000   50      0.7469148
##   0.4  4          0.6               1.0000000  100      0.7429984
##   0.4  4          0.6               1.0000000  150      0.7411131
##   0.4  4          0.6               1.0000000  200      0.7393511
##   0.4  4          0.8               0.5000000   50      0.7304144
##   0.4  4          0.8               0.5000000  100      0.7275738
##   0.4  4          0.8               0.5000000  150      0.7241343
##   0.4  4          0.8               0.5000000  200      0.7229221
##   0.4  4          0.8               0.6666667   50      0.7365866
##   0.4  4          0.8               0.6666667  100      0.7319421
##   0.4  4          0.8               0.6666667  150      0.7296644
##   0.4  4          0.8               0.6666667  200      0.7266519
##   0.4  4          0.8               0.8333333   50      0.7439798
##   0.4  4          0.8               0.8333333  100      0.7399107
##   0.4  4          0.8               0.8333333  150      0.7377882
##   0.4  4          0.8               0.8333333  200      0.7358172
##   0.4  4          0.8               1.0000000   50      0.7528588
##   0.4  4          0.8               1.0000000  100      0.7482183
##   0.4  4          0.8               1.0000000  150      0.7437316
##   0.4  4          0.8               1.0000000  200      0.7406688
##   Sens       Spec     
##   0.6925042  0.6928872
##   0.6896245  0.6858733
##   0.6738874  0.6837064
##   0.6680973  0.6922353
##   0.6832142  0.7141279
##   0.6846264  0.7113212
##   0.6982759  0.6900442
##   0.6896490  0.6843583
##   0.6989636  0.6943298
##   0.6910797  0.6893500
##   0.6868000  0.6865312
##   0.6874939  0.6858190
##   0.7047291  0.6915170
##   0.6953900  0.6943177
##   0.6867755  0.6964665
##   0.6867693  0.6971546
##   0.6796223  0.7035407
##   0.6867877  0.7007219
##   0.6760303  0.6872435
##   0.6932533  0.6907745
##   0.6867693  0.6971425
##   0.6860571  0.6922111
##   0.6824958  0.6879557
##   0.6889061  0.6914566
##   0.6975452  0.6964243
##   0.6867570  0.6971305
##   0.6860571  0.6935934
##   0.6846387  0.6978427
##   0.7075781  0.6964363
##   0.6996942  0.6964303
##   0.6925165  0.6928932
##   0.6882000  0.6971305
##   0.6738444  0.6957060
##   0.6767241  0.6809117
##   0.6767303  0.6830122
##   0.6695587  0.6737831
##   0.6623686  0.6999976
##   0.6681096  0.6893923
##   0.6745321  0.6844367
##   0.6594950  0.6731131
##   0.6867325  0.6964484
##   0.6774487  0.6822879
##   0.6630931  0.6823060
##   0.6594889  0.6773565
##   0.6767241  0.7000157
##   0.6724261  0.6921930
##   0.6724384  0.6851007
##   0.6716770  0.6815998
##   0.6673851  0.7014040
##   0.6645790  0.6879014
##   0.6480131  0.6715981
##   0.6386740  0.6695338
##   0.6623563  0.6921809
##   0.6723708  0.6815575
##   0.6673114  0.6695821
##   0.6616072  0.6738314
##   0.6852957  0.7056352
##   0.6824467  0.6957362
##   0.6745628  0.6879557
##   0.6580583  0.6858371
##   0.6752935  0.7042892
##   0.6630931  0.6886619
##   0.6666482  0.6731071
##   0.6717015  0.6738314
##   0.6702709  0.6808815
##   0.6501682  0.6850826
##   0.6608704  0.6808392
##   0.6544417  0.6815575
##   0.6551847  0.6879436
##   0.6680543  0.6745135
##   0.6608950  0.6752197
##   0.6637440  0.6780506
##   0.6702955  0.6943298
##   0.6623256  0.6900563
##   0.6573153  0.6901227
##   0.6601766  0.6964725
##   0.6882491  0.6900563
##   0.6702648  0.6745075
##   0.6731138  0.6759078
##   0.6702525  0.6717007
##   0.6702648  0.6646446
##   0.6551847  0.6773685
##   0.6522620  0.6801632
##   0.6522804  0.6787568
##   0.6644992  0.6908047
##   0.6666605  0.6851731
##   0.6659421  0.6787870
##   0.6609380  0.6759621
##   0.6595257  0.7049290
##   0.6631054  0.6893681
##   0.6566215  0.6936477
##   0.6530172  0.6865373
##   0.6745751  0.6922172
##   0.6731138  0.6822698
##   0.6651992  0.6816119
##   0.6594582  0.6780506
##   0.6602073  0.6780687
##   0.6594521  0.6752801
##   0.6630502  0.6702762
##   0.6637563  0.6667210
##   0.6788240  0.6851188
##   0.6702279  0.6751835
##   0.6622888  0.6837245
##   0.6594521  0.6815817
##   0.6595319  0.7077720
##   0.6451763  0.6759078
##   0.6429905  0.6787447
##   0.6415414  0.6815636
##   0.6702648  0.6773263
##   0.6638177  0.6745075
##   0.6559092  0.6702943
##   0.6580644  0.6681334
##   0.6537049  0.6843703
##   0.6608704  0.6624294
##   0.6551233  0.6681213
##   0.6508191  0.6731011
##   0.6666482  0.6794449
##   0.6551663  0.6808453
##   0.6594643  0.6744954
##   0.6594582  0.6823000
##   0.6652176  0.6887102
##   0.6616441  0.6759501
##   0.6544909  0.6900744
##   0.6602257  0.6808755
##   0.6587951  0.6865373
##   0.6487253  0.6822939
##   0.6580337  0.6752076
##   0.6609073  0.6823060
##   0.6796345  0.7020679
##   0.6810222  0.6957543
##   0.6731076  0.7042409
##   0.6860018  0.6844307
##   0.6903367  0.6893621
##   0.6824835  0.6957603
##   0.6710261  0.6929234
##   0.6774425  0.6971546
##   0.7039616  0.6907504
##   0.6946532  0.6851067
##   0.6889061  0.6851067
##   0.6874816  0.6922232
##   0.6961207  0.7013859
##   0.6946593  0.6914747
##   0.6874939  0.6858190
##   0.6839142  0.6900804
##   0.6738751  0.7013738
##   0.6709893  0.6907745
##   0.6781179  0.6780204
##   0.6724138  0.6695338
##   0.6953900  0.6928811
##   0.6824713  0.6921387
##   0.6875246  0.6737711
##   0.6817467  0.6737771
##   0.6961145  0.6907745
##   0.6889245  0.6900744
##   0.6731567  0.6879436
##   0.6838896  0.6823000
##   0.7075842  0.6985489
##   0.6896490  0.6893380
##   0.6860509  0.6971365
##   0.6867570  0.6907745
##   0.6645852  0.6935692
##   0.6673851  0.6836762
##   0.6666175  0.6737409
##   0.6573214  0.6709824
##   0.6766996  0.7049109
##   0.6695341  0.6886921
##   0.6702218  0.6738254
##   0.6694850  0.6681395
##   0.6838896  0.6829881
##   0.6609441  0.6858069
##   0.6565908  0.6858190
##   0.6616502  0.6830122
##   0.6881693  0.6950179
##   0.6760057  0.6844246
##   0.6630870  0.6645963
##   0.6688464  0.6617775
##   0.6709954  0.6957120
##   0.6809976  0.6716343
##   0.6737953  0.6773142
##   0.6587398  0.6865252
##   0.6673728  0.6801693
##   0.6744953  0.6723586
##   0.6601643  0.6660087
##   0.6579723  0.6745135
##   0.6738874  0.6971305
##   0.6630931  0.6900985
##   0.6759566  0.6872676
##   0.6616257  0.6731252
##   0.6809915  0.6943177
##   0.6752812  0.6759440
##   0.6659053  0.6738133
##   0.6551417  0.6709945
##   0.6766689  0.6886559
##   0.6558540  0.6794389
##   0.6630440  0.6596347
##   0.6652176  0.6688396
##   0.6716770  0.6836581
##   0.6651808  0.6900381
##   0.6651746  0.6780144
##   0.6486946  0.6815455
##   0.6702341  0.6744592
##   0.6637747  0.6772961
##   0.6687911  0.6766201
##   0.6601704  0.6723767
##   0.6716831  0.6864709
##   0.6616502  0.6680550
##   0.6565969  0.6716222
##   0.6515559  0.6723586
##   0.6659053  0.6731373
##   0.6472701  0.6752318
##   0.6544233  0.6674574
##   0.6522497  0.6716645
##   0.6644869  0.6928992
##   0.6587644  0.6872374
##   0.6630686  0.6865493
##   0.6651869  0.6986214
##   0.6681034  0.6929113
##   0.6651746  0.6745014
##   0.6565785  0.6723888
##   0.6630317  0.6731131
##   0.6752751  0.6864829
##   0.6659544  0.6744471
##   0.6630931  0.6758837
##   0.6645238  0.6766140
##   0.6594643  0.6779782
##   0.6666605  0.6708979
##   0.6559031  0.6716041
##   0.6516050  0.6716283
##   0.6738322  0.6816300
##   0.6601950  0.6695760
##   0.6630440  0.6681213
##   0.6601520  0.6667331
##   0.6637440  0.6872676
##   0.6573214  0.6744954
##   0.6508743  0.6738012
##   0.6573214  0.6801994
##   0.6637992  0.6886861
##   0.6523357  0.6801813
##   0.6545031  0.6773444
##   0.6566460  0.6844307
##   0.6602134  0.6653508
##   0.6601397  0.6639082
##   0.6587091  0.6575523
##   0.6594152  0.6476472
##   0.6401169  0.6886438
##   0.6515743  0.6766201
##   0.6529865  0.6815636
##   0.6450781  0.6836762
##   0.6566031  0.6766744
##   0.6573153  0.6773565
##   0.6566031  0.6759501
##   0.6572969  0.6702822
##   0.6630870  0.6957724
##   0.6680789  0.6844428
##   0.6630563  0.6723888
##   0.6609073  0.6709764
## 
## Tuning parameter 'gamma' was held constant at a value of 0
## 
## Tuning parameter 'min_child_weight' was held constant at a value of 1
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 50, max_depth = 1,
##  eta = 0.3, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1
##  and subsample = 1.
xgb_fit$finalModel
## ##### xgb.Booster
## raw: 13.1 Kb 
## call:
##   xgboost::xgb.train(params = list(eta = param$eta, max_depth = param$max_depth, 
##     gamma = param$gamma, colsample_bytree = param$colsample_bytree, 
##     min_child_weight = param$min_child_weight, subsample = param$subsample), 
##     data = x, nrounds = param$nrounds, objective = "binary:logistic")
## params (as set within xgb.train):
##   eta = "0.3", max_depth = "1", gamma = "0", colsample_bytree = "0.8", min_child_weight = "1", subsample = "1", objective = "binary:logistic", silent = "1"
## xgb.attributes:
##   niter
## callbacks:
##   cb.print.evaluation(period = print_every_n)
## # of features: 10 
## niter: 50
## nfeatures : 10 
## xNames : age los glucose_num sodium_num wbc_count_num calcium_num hemoglobin_num creatinine_num urea_nitrogen_num chloride_num 
## problemType : Classification 
## tuneValue :
##     nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 29      50         1 0.3     0              0.8                1         1
## obsLevels : Alive dead 
## param :
##  list()
test_pred_xgb <- predict(xgb_fit, newdata = im_test)
confusionMatrix(test_pred_xgb, im_test$status )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Alive dead
##      Alive   137   63
##      dead     61  138
##                                           
##                Accuracy : 0.6892          
##                  95% CI : (0.6413, 0.7343)
##     No Information Rate : 0.5038          
##     P-Value [Acc > NIR] : 4.575e-14       
##                                           
##                   Kappa : 0.3785          
##                                           
##  Mcnemar's Test P-Value : 0.9284          
##                                           
##             Sensitivity : 0.6919          
##             Specificity : 0.6866          
##          Pos Pred Value : 0.6850          
##          Neg Pred Value : 0.6935          
##              Prevalence : 0.4962          
##          Detection Rate : 0.3434          
##    Detection Prevalence : 0.5013          
##       Balanced Accuracy : 0.6892          
##                                           
##        'Positive' Class : Alive           
## 
par(pty = "s")
test_pred_xgb2 <- predict(xgb_fit, newdata = im_test,type="prob")
roc(im_test$status,test_pred_xgb2[,2], plot=T,legacy.axes = T, col="black",
    main="ROC Curve of XGBoost Model", xlab="FPR", ylab = "TPR",
     print.auc = T, print.auc.x=0.4,print.auc.y=0.3)

## 
## Call:
## roc.default(response = im_test$status, predictor = test_pred_xgb2[,     2], plot = T, legacy.axes = T, col = "black", main = "ROC Curve of XGBoost Model",     xlab = "FPR", ylab = "TPR", print.auc = T, print.auc.x = 0.4,     print.auc.y = 0.3)
## 
## Data: test_pred_xgb2[, 2] in 198 controls (im_test$status Alive) < 201 cases (im_test$status dead).
## Area under the curve: 0.7531

Multiple ROC curves of different models.

par(pty = "s")
roc.curve(im_test$status,p_log)
## Area under the curve (AUC): 0.748
roc.curve(im_test$status,test_pred_sl2[,2], add=TRUE, col=2,
lwd=2, lty=2)
## Area under the curve (AUC): 0.747
roc.curve(im_test$status,test_pred_sp2[,2], add=TRUE, col=3,
lwd=2, lty=3)
## Area under the curve (AUC): 0.742
roc.curve(im_test$status,test_pred_sr2[,2], add=TRUE, col=4,
lwd=2, lty=4)
## Area under the curve (AUC): 0.746
roc.curve(im_test$status,test_pred_dt2[,2], add=TRUE, col=5,
lwd=2, lty=5)
## Area under the curve (AUC): 0.700
roc.curve(im_test$status,test_pred_rf2[,2], add=TRUE, col=6,
lwd=2, lty=6)
## Area under the curve (AUC): 0.761
roc.curve(im_test$status,test_pred_xgb2[,2],add=TRUE, col=7,
lwd=2, lty=7)
## Area under the curve (AUC): 0.754
legend("bottomright", 
       c("Logistic Regression", "SVM Linear", "SVM Poly", "SVM RBF", "Decision tree", 
         "Random Forest","XGBoost"),
       col=1:7, lty=1:7, lwd=2,cex = 0.5)

par(pty = "m")

10. Briefly summarize your findings, including a few sentences stating what your plots reveal and evaluating/comparing your models.

Evaluation of multiple models:

  Model                   AUC                 Sensitivity               Specifity
Logistic regression      0.748                   0.84                     0.50 
SVM Linear               0.747                   0.69                     0.71
SVM Polynomial:          0.742                   0.71                     0.68
SVM RBF:                 0.746                   0.67                     0.69
Decision Tree:           0.700                   0.55                     0.80
Random Forest:           0.761                   0.63                     0.75
XGBoost:                 0.754                   0.69                     0.69

Although AUC of decision tree model is lowerest in these models, it can predict 80% of patients mortality. It’s suitable for practice. And the tree plot indicates that we can put age, glucose, wbc_count and urea_nitrogen as predictors for heart failure mortality. This result is also consistent with domain knowledge. The graph of tree is also easily for representation. So I would like to choose decision tree models as inference model rather than logistic regression model.

For predictable model, AUC value of random forest and xgboost model are similar and higher than three SVM model, but 0.76 is not a good result. The reason is that the observations of dataset is not enough, we need huge observation for better results of machine learning methods.